Implementation Experiences in Transparently Harnessing Cluster-Wide Memory
نویسندگان
چکیده
There is a constant battle to break even between continuing improvements in DRAM capacities and the growing memory demands of large-memory high-performance applications. Performance of such applications degrades quickly once the system hits the physical memory limit and starts swapping to the local disk. In this paper, we investigate the benefits and tradeoffs in pooling together the collective memory resources of nodes across a high-speed LAN based cluster. We present the design, implementation and evaluation of Anemone – an Adaptive Network Memory Engine – that virtualizes the collective unused memory of multiple machines across a gigabit Ethernet LAN, without requiring any modifications to the large memory applications. We have implemented a working prototype of Anemone and evaluated it using real-world unmodified applications such as ray-tracing and large in-memory sorting. Our results with Anemone prototype show that unmodified single-process applications execute 2 to 3 times faster and multiple concurrent processes execute 6 to 7.7 times faster, when compared to disk based paging. The Anemone prototype reduces page-fault latencies by a factor of 19.6 – from an average of 9.8ms with disk based paging to 500μs with Anemone. Most importantly, Anemone provides a virtualized low-latency access to potentially “unlimited” memory resources across the network.
منابع مشابه
Anemone: Transparently Harnessing Cluster-Wide Memory
There is a constant battle to break even between continuing improvements in DRAM capacities and the growing memory demands of large-memory high-performance applications. Performance of such applications degrades quickly once the system hits the physical memory limit and starts swapping to the local disk. We present the design, implementation and evaluation of Anemone – an Adaptive Network Memor...
متن کاملFast Transparent Cluster-Wide Paging
In a cluster with a very low-latency interconnect, the remote memory of nodes can serve as a storage that is faster than local disk but slower than local memory. In this paper, we address the problem of transparently utilizing this cluster-wide pool of unused memory as a low-latency paging device. Such a transparent remote memory paging system can enable large-memory applications to benefit fro...
متن کاملThe implementation of the em * multi - microprocessort by RICHARD
The implementation of a hierarchical, packet switched mUltiprocessor is presented. The lowest level of the structure, a Computer Module, is a processor-memory pair. Computer Modules are grouped to form a cluster; communication within the cluster is via a parallel bus controlled by a centralized address mapping processor. Clusters communicate via intercluster busses. A memory reference by a prog...
متن کاملKIMP: Multicheckpointing Multiprocessors
Multiprocessors are coming into wide-spread use in many application areas, yet there are a number of challenges to achieving a good tradeoff between complexity and performance. For example, while implementing memory coherence and consistency is essential for correctness, efficient implementation of critical sections and synchronization points is desirable for performance. The multi-checkpointin...
متن کاملOS Experimentation and a User Community Coexist Under the DUnX Kernel
The class of NUMA (nonuniform memory access time) shared memory architectures is becoming increasingly important with the desire for larger scale multiprocessors. In such machines, the placement and movement of code and data are crucial to performance. The operating system can play a role in managing placement through the policies and mechanisms of the virtual memory subsystem. An implementatio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006